PivotCompress: Compression by Sorting
نویسنده
چکیده
Sorted data is usually easier to compress than unsorted permutations of the same data. This motivates a simple compression scheme: specify the sorted permutation of the data along with a representation of the sorted data compressed recursively. The sorted permutation can be specified by recording the decisions made by quicksort. If the size of the data is known, then the quicksort decisions describe the data at a rate that is nearly as efficient as the minimal prefix-free code for the distribution, which is bounded by the entropy of the distribution. This is possible even though the distribution is unknown ahead of time. Used in this way, quicksort acts as a universal code in that it is asymptotically optimal for any stationary source. The Shannon entropy is a lower bound when describing stochastic, independent symbols. However, it is possible to encode non-uniform, finite strings below the entropy of the sample distribution by also encoding symbol counts because the values in the sequence are no longer independent once the counts are known. The key insight is that sparse quicksort comparison vectors can also be compressed to achieve an even lower rate when data is highly non-uniform while incurring only a modest penalty when data is random. Sorted data is usually easier to compress than unsorted permutations of the same data. Because sequential values are in order when data is sorted, their non-negative differences can be encoded in place of the original values. And because repeated values are all contiguous, they can be encoded by including a count along with the first instance. This motivates a simple compression scheme: first describe how to permute the data into sorted order, and then describe the sorted data. The permutation is invertible, so it can 1 ar X iv :1 41 1. 51 27 v2 [ cs .D S] 2 1 N ov 2 01 4 be used along with the description of the sorted data to generate the original data. Because there are N ! permutations of a sequence of N items, log2(N !) bits are required to specify a particular permutation. But if there is any frequency regularity in the data, it is possible to specify the sorted permutation using fewer bits because there are fewer distinct permutations. One way to specify the sorted permutation of data that uses fewer bits when data is non-uniform is to record the decisions made by the quicksort algorithm as it recursively partitions the data around pivots. In quicksort, each item in the sequence is compared to a pivot value. Based on that comparison, it is assigned to either the left or the right partition, and the algorithm is applied recursively to each partition. The recursion terminates when the sequence cannot be partitioned, either because it is a single item or because all of the items are equal. When quicksort terminates, all of the occurrences of a given symbol are associated with a single leaf node, and the leaves are arranged in sort order [CSRL01]. Even without the original data and pivot values, the sorted sequence can be regenerated by running the quicksort algorithm again, using the recorded decisions instead of comparing data to pivots. Using the recorded decisions, the index array X[i] = i is transformed by this algorithm into a permutation vector X[i] = j, indicating that the value in position i is moved to position j. The inverse permutation vector, defined by Y [X[i]] = i can be used to permute the sorted data back into the original order. An example quicksort partition tree is shown in figure 1. The corresponding decision bitvectors are depicted in figure 2. Assume the initial sequence is of length N , and there are M unique symbols. Let ci represent the count of the i’th symbol. When the pivots are selected so that the partitions are maximally uniform, the leaf for the i’th symbol is reached after approximately log2(N/ci) partitions. Even if the data cannot be uniformly partitioned, at most one extra partition is necessary. For each of the ci instances of a given symbol, all of the comparisons and corresponding partition decisions need to be recorded. The total number of comparisons is approximately∑ ci ∗ log2(N/ci). Using pi = ci/N , this can be written as the familiar N ∗ ∑ pi ∗ log2(1/pi), showing that the number of comparisons on average is approximately equal to the entropy of the distribution. The maximally uniform quicksort tree
منابع مشابه
Text Compression using Recency Rank with Context and Relation to Context Sorting, Block Sorting and PPM*
Recently block sorting compression scheme was developed and relation to statistical scheme was studied, but theoretical analysis of performance has not been studied well. Context sorting is a compression scheme based on context similarity and it is regarded as an online version of the block sorting and it is asymptotically optimal. However, the compression speed is slower and the real performan...
متن کاملImproving the Speed of LZ77 Compression by Hashing and Suffix Sorting
Two new algorithms for improving the speed of the LZ77 compression are proposed. One is based on a new hashing algorithm named two-level hashing that enables fast longest match searching from a sliding dictionary, and the other uses suffix sorting. The former is suitable for small dictionaries and it significantly improves the speed of gzip, which uses a naive hashing algorithm. The latter is s...
متن کاملBlock Sorting and Compression
The Block Sorting Lossless Data Compression Algorithm (BSLDCA) described by Burrows and Wheeler 3] has received considerable attention. It achieves as good compression rates as context-based methods, such as PPM, but at execution speeds closer to Ziv-Lempel techniques 5]. This paper, describes the Lexical Permutation Sorting Algorithm (LPSA), its theoretical basis, and delineates its relationsh...
متن کاملBlock Sorting Text Compression — Final Report
A recent development in text compression is a “block sorting” algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-to-Front and a final statistical compressor. The technique combines good speed with excellent compression performance. This report investigates the block sorting compression algorithm, in particular trying to u...
متن کامل2-D ECG Compression Using Optimal Sorting and Mean Normalization
In this paper, we propose an effective compression method for electrocardiogram (ECG) signals. 1-D ECG signals are reconstructed to 2-D ECG data by period and complexity sorting schemes with image compression techniques to increase inter and intra-beat correlation. The proposed method added block division and mean period normalization techniques on top of conventional 2-D data ECG compression m...
متن کاملThe Burrows-Wheeler Transform for Block Sorting Text Compression: Principles and Improvements
A recent development in text compression is a “block sorting” algorithm which permutes the input text according to a special sort procedure and then processes the permuted text with Move-to-Front and a final statistical compressor. The technique combines good speed with excellent compression performance. This paper investigates the fundamental operation of the algorithm and presents some improv...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1411.5127 شماره
صفحات -
تاریخ انتشار 2014